This workshop will introduce you to working with the command line in the Bash shell and using basic UNIX commands to manipulate files and their contents.
We will go through the entire workshop together. The workshop consists of both lectures and hands-on exercises. I have prepared an HTML file that you can use as a guide. This file will serve as a reference during the workshop and for your future analyses.
Important: Please make sure to think critically and understand the
commands you are running, why you are running them, and what options are
available. Always use the man command or the
--help option to explore additional features of a
command.
Simply open a terminal window. You can do this by searching for “Terminal” in your system’s search bar. A command prompt window should appear, ready for use.
Download and install PuTTY from link:[https://www.putty.org/]. Use PuTTY to log in to your Amazon cloud instance.
Alternatively, you can use the pre-installed Windows PowerShell, which functions similarly to the Linux command line. Note that there are differences in commands between PowerShell and Linux terminals. You can refer to this link:[https://blog.netwrix.com/powershell-commands-cheat-sheet] for guidance.
To execute the command, type the name of the command at the prompt.
ls
## 0.5
## Calanus.sh
## E6609_SampleSheet.csv
## Pop-A-L-37_1.fastq.gz
## Pop-A-L-37_2.fastq.gz
## Pop-A-L-7_1.fastq.gz
## Pop-A-L-7_2.fastq.gz
## Pop-A-S-20_1.fastq.gz
## Pop-A-S-20_2.fastq.gz
## PopA-S-1_1.fastq.gz
## PopA-S-1_2.fastq.gz
## PopC-L-11_1.fastq.gz
## PopC-L-11_2.fastq.gz
## PopC-L-2_1.fastq.gz
## PopC-L-2_2.fastq.gz
## PopC-S-16_1.fastq.gz
## PopC-S-16_2.fastq.gz
## PopC-S-4_1.fastq.gz
## PopC-S-4_2.fastq.gz
## Rmarkdown_tutorials.Rmd
## Rmarkdown_tutorials.html
## Rplot.pdf
## Rplot01_pca_plot_8_indv_square.png
## Rplot01_pca_plot_square.pdf
## Rplot_.frq.png
## Rplot_PCA.png
## Rplot_l_depth.png
## Rplot_lmiss.png
## Rplot_lqual.png
## Rplot_milkfish_pca_8_indv.png
## Rplot_milkifish_8_indv.pdf
## bcl2fastqshell.sh
## big_data.fastq
## extracted_contigs.fastq
## extracted_contigs.fastq.gz
## fastqc_html
## full_filter_nomaf_milkfish_snps_filtered.vcf.gz
## hello.sh
## html_002.html
## html_00_Data_&_Software_Installation.Rmd
## html_00_Data_Software_Installation.html
## html_02_Bash_scripting.Rmd
## html_02_Bash_scripting.html
## html_03.Rmd
## html_03.html
## html_03.log
## html_03.tex
## html_ppt_04.html
## html_ppt_05.Rmd
## loop.sh
## milkfish_edited.sh
## milkfish_filtered.frq
## milkfish_filtered.het
## milkfish_filtered.idepth
## milkfish_filtered.imiss
## milkfish_filtered.ldepth.mean
## milkfish_filtered.lmiss
## milkfish_filtered.log
## milkfish_filtered.lqual
## milkfish_pca_nomaf.eigenval
## milkfish_pca_nomaf.eigenvec
## milkfish_snps_filtered.vcf.gz
## names_sorted.txt
## newfile.txt
## participants.txt
## pca
## rsconnect
## seqtk
## session1_file1.txt
## session1_file2.txt
ls is a shortcut for list which corresponds to listing
all names of the files contained in the directory
If you type:
man ls
You will bring up the manual for that command. Please note that you
can use manand almost all basic executebale shell command
line or alternatively use --help to output basic
information about the command.
Navigation - where am I?
pwd
## /Users/apollo/Documents/Popgen_Workshop/Day_001
pwdmeans present working directory. It outputs your
current working directory or simply where is your current location. This
is very important especially if you want to know the path where your
input or output files are located or where the software you are going to
use is located in case you haven’t put them in the systems directory
(will discuss this later on)
make a directory
mkdir test
A new test folder is created.
Now try to remove the folder.
What code are you going to use?
rm -r test
To delete EMPTY directories from the system, you can use rmdir (remove directory) command.
rmdir DIRECTORY
To make an empty file you can use touch
touch newfile.txt
To delete a file, the ‘rm’ command can be used
rm newfile.txt
to copy a file use the cp command
You have to specify the path to the file you want to copy and its destination.
cp newfile.txt
lets say I want you to copy the folder0.5 to your home
directory
cp -r 0.5/ /Users/apollo or use `~``
cp -r 0.5/ ~
just type cp name of the file you want to copy and the
folder where you wanted to copy the file. Careful on where you save the
file as you can overwrite the file itself.
Let say I want to copy newfile.txt in the 0.5/ directory
I simply type:
cp newfile.txt 0.5/
then type ls
to check if the file i copied is in the 0.5/ folder
Every file on the system has a set of permissions that determine who can read, change or delete, or execute the file.
By default, all files you create in your account are
readable, changeable or
executable by you.
Other files, not created by you may have different permissions
To see the permission settings for a file or files in a folder, we can use the ls command as follows:
ls -l
or
ls -l filename
Now, try to check the permission of each of the files saved on your directory. What does it say?
File permissions are split into groups of threes, and each position
in the group denotes a specific permission, in this order:
read (r), ´write (w)´, ´execute (x)´ - ´rwx´
The first three characters (2-4) represent the permissions for the ´file’s owner´. -rwxr-xr– represents that the owner has read (r), write (w) and execute (x) permission.
The second group of three (5-7) are the permissions for the ´group to which the file belongs´. -rwxr-xr– represents that the group has read (r) and execute (x) permission, but no write permission.
The last group (8-10) represents the permissions for ´everyone else´. For example, -rwxr-xr– represents that there is read (r) only permission.
Sometimes you will need to change the permissions on files in order to execute them.
The command to change permissions is chmod
You have to specify who you are modifying the permissions of, what the new permissions are, and what file or directory to act on.
For example, You can quickly change file permission using codes. This will allow a file to be executed which is important for scripts.
chmod 777 filename
There are many ways to view a text file. One way for simple viewing is to type:
less filename
Use less to look at a big files in your directory try to view
big_data.fastq
less big_data.fastq
The less command displays as much of the file as can fit onto the screen. To scroll up and down within the document, use the arrow keys. Hitting the space bar will bring a new screen-full of information.
To search forward in the file for a given pattern, click
cmd (ctrl) + f, then type the pattern you wanted to search
in the file. For example, I wanted to look for sequence that contains
´AAACCCGGGTTT´
Remember that you need to be in the less mode to perform
this command.
To exit the less program and return to the prompt, press:
q
##Viewing part of the file
The head command displays the first few lines at the top of a file.
The switch -n allows you to specify how many lines to
display, starting from the first line
head -n 10 big_data.fastq
## @LH00684:49:22JCVJLT4:7:1101:2189:1042 1:N:0:CATGAGCA
## CNTATATAGACTCTCACACTCACACACATGTACACTTACACACACACACAGAAAGAAAGAGAGAGAGAGAAATTTTATTCAGATCAATGATGCTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCATGAGCAATCTGGGGGGGTG
## +
## I#I9IIIIIIIIIIIIIIIIIIIIIIII9II9IIIIIIIIIIIIIIIII99IIIIII9III9I9II99IIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIII999II99I9III*9*99*9*I**
## @LH00684:49:22JCVJLT4:7:1101:4422:1042 1:N:0:CATGAGCA
## GNGTACTTTTCACATACTACTCTACGCATACTATGTATTTGGACGTACTACTAATGGTGGCGTACTGTTTTGACGTACTATTTAGTACGTTAGTATGCGGGTTTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCATGAGCAATCT
## +
## I#IIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII99
## @LH00684:49:22JCVJLT4:7:1101:5231:1042 1:N:0:CATGAGCA
## CNATTACTGGCTTTTCATTCCTTTGGCTGATTTATGATGTTTAAACAGATTGATGGTCAAGAAAATCTGTTACGGCTCTAGAGTCCATCCCTGTCAGAGCTACAAGAAACACATCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCA
Do you see similar output on your screens?
If you specify a negative number N, then all the lines except the bottom N will be displayed.
Print everything except the last line:
head -n -1 big_data.fastq
Try!
The tail command displays the last few lines of a file. By default tail will show the last ten lines of a file, but you can tell it how many lines to display.
tail -n 10 big_data.fastq
## +
## II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9III*IIIIIIII9IIIIIIIIIIIIIIII
## @LH00684:49:22JCVJLT4:7:2498:49500:29767 1:N:0:CATGAGCA
## AANGGCAAGCTGTAAGTAGCAGGTATCACAACTGGTGGGGTTCATGGTTGTCTTCAATTAACCGAACATAGAGGGATGAGCTGGATCACGGCTGGCAGGGATCACAGTTCTGCTCAGCAGGAGATCGGAAGAGCACACGTCTGAACTCCAG
## +
## II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIII
## @LH00684:49:22JCVJLT4:7:2498:51813:29767 1:N:0:CATGAGCA
## TGNTGTCTGACTAGGTCAGCAAGGGTCAGAGGAATCGGAGCGTCAGGACACTGGCCTTTCTTTCCTAATCCTGATAATTGCTTAGAGAGAGCCGATGTCTCACAACCTCCCGAAATGTCTTTCATTGTCTCCTCTCTAACAAGACCTGGGG
## +
## II#IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9I9IIIIIIIIIIIII*99IIII99IIII9II9IIIIII9IIII9IIII*IIIII9I9I9I9*9IIII*IIII9II99IIIII9I9I*99III9IIII9I9I*
shows the last ten lines of that file
You can also use the -f option which appends all new
data to screen and is very useful for tracking the progress of software
– e.g. mrBayes.
tail -f -n 10 big_data.fastq
paste is a lovely command that does all sorts of interesting things paste can transpose data from a single column into a single row (delimited by whatever you like: tabs, commas, spaces etc).
This will transpose the contents of the session1 file into a single line.
First, let’s view the file using less
less session1_file1.txt
## Mia
## Felix
## KC
## Inggat
## TJ
## Errol
## Belay
## Tabards
## Gela
## Anj
## Raf
## Tin
## Aye
## Janelle
## Apollo
## LeQin
## Syd
## Reina
## Rachel
It should be a single column
paste -s session1_file1.txt
## Mia Felix KC Inggat TJ Errol Belay Tabards Gela Anj Raf Tin Aye Janelle Apollo LeQin Syd Reina Rachel
try specifying a delimiter in this case “,”
paste -s -d"," session1_file1.txt
## Mia ,Felix ,KC ,Inggat ,TJ ,Errol ,Belay ,Tabards ,Gela ,Anj ,Raf ,Tin ,Aye ,Janelle ,Apollo ,LeQin ,Syd ,Reina ,Rachel
paste can also intersperse all the rows of two files into a single combined file.
Have a look at session1_file1.txt and session1_file2.txt using
less or cat
paste -d"\n" session1_file1.txt session1_file2.txt > participants.txt
Now view the file participants.txt
What did you see?
Challenge 1: Can you try to paste the ´namesandsurname`
together side-by-side?
Really useful for all sorts of things when messing about with file formatting. e.g. turning a list of snp identifiers (headers) with one identifier per genotype into 1 identifier per allele.
cat session1_file1.txt > newfile.txt
Creates a new file (new file) with same contents as old file (session1_file1.txt)
Now try:
cat session1_file1.txt >> session1_file2.txt
Appends the contents for file1 to file2, equivalent to opening file1, copying all the contents, pasting the copied contents to the end of the file2 and saving it.
cat session1_file1.txt session1_file1.txt session1_file1.txt > newfile2.txt
Copies contents of files 1-3 into file 4.
Notes: If you have lots of files and each of these files contain a single FASTA file.
You can combine them all together to make a single file “sequences.fasta” using redirects.
cat *.fas >> sequences.fasta
will combine all .fas files into a single file.
This command splits a file into a series of smaller files.
The content of the input file is split into ordered files named with the prefix “x”, unless another prefix is provided as argument to the command.
The switch -l lets you specify how many lines to include
in each file.
Split the content into separate files of 3 lines each and output to new files prefixed with the name small
split -l 3 session1_file1.txt small
What happened? Can you tell us?
This command prints the lines of the input file that match the given pattern(s).
Very useful for file manipulation.
Excellent for searching for a particular pattern in a file and outputting the results to screen or file
print lines that match “Apollo”
grep Apollo session1_file1.txt
## Apollo
“-v” performs a “reverse-matching” and prints only the lines that do not match the pattern.
grep -v Apollo session1_file1.txt
## Mia
## Felix
## KC
## Inggat
## TJ
## Errol
## Belay
## Tabards
## Gela
## Anj
## Raf
## Tin
## Aye
## Janelle
## LeQin
## Syd
## Reina
## Rachel
“-i” specifies a case-insensitive match (by default this command is case sensitive).
grep -i apollo session1_file1.txt
## Apollo
Everyone with a on their names
grep 'a' session1_file1.txt
## Mia
## Inggat
## Belay
## Tabards
## Gela
## Raf
## Janelle
## Reina
## Rachel
Count sequences in fasta file
grep -c "^>" big_data.fastq
Breathe! This may take a while :)
awk is very useful
Extract a column from a data file
awk '{ print $2 }' big_data.fastq > column2_big_data.fastq
Extract selected columns from a data file
awk '{ print $7, $8, $9 }' big_data.fastq
Or use cut
cut –f3-5 big_data.fastq
Now compare the results of awk and cut
command. What did you observe?
In bash you can pipe the output from one command to
another using the | symbol.
For example
ls -l | grep '\.txt$'
## -rw-r--r--@ 1 apollo staff 120 Apr 2 15:54 names_sorted.txt
## -rw-r--r--@ 1 apollo staff 120 Mar 15 11:38 newfile.txt
## -rw-r--r--@ 1 apollo staff 301 Apr 2 15:55 participants.txt
## -rw-r--r-- 1 apollo staff 120 Mar 15 11:24 session1_file1.txt
## -rw-r--r-- 1 apollo staff 181 Mar 15 11:31 session1_file2.txt
the output of the program ls -l is sent to the grep
program, which, in turn, will print lines which match the regex
.txt$.
To find .gz files use
ls -l | grep '\.gz$'
## -rwxr-xr-x 1 apollo staff 4042378619 Feb 16 11:34 Pop-A-L-37_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 4295840551 Feb 16 11:34 Pop-A-L-37_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 3281988085 Feb 16 11:35 Pop-A-L-7_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 3470062588 Feb 16 11:35 Pop-A-L-7_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 3850705868 Feb 16 11:25 Pop-A-S-20_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 4093409300 Feb 16 11:25 Pop-A-S-20_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 5336438506 Feb 16 11:24 PopA-S-1_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 5539019476 Feb 16 11:24 PopA-S-1_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 1892665417 Feb 16 11:36 PopC-L-11_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 2049544307 Feb 16 11:36 PopC-L-11_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 1451179507 Feb 16 11:37 PopC-L-2_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 1520890252 Feb 16 11:37 PopC-L-2_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 3275253918 Feb 16 11:35 PopC-S-16_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 3394558311 Feb 16 11:35 PopC-S-16_2.fastq.gz
## -rwxr-xr-x 1 apollo staff 2082049955 Feb 16 11:35 PopC-S-4_1.fastq.gz
## -rwxr-xr-x 1 apollo staff 2197770802 Feb 16 11:35 PopC-S-4_2.fastq.gz
## -rw-r--r--@ 1 apollo staff 0 Mar 16 11:27 extracted_contigs.fastq.gz
## -rw-r--r-- 1 apollo staff 3733738 Mar 30 20:21 full_filter_nomaf_milkfish_snps_filtered.vcf.gz
## -rw-r--r-- 1 apollo staff 424144608 Mar 25 11:51 milkfish_snps_filtered.vcf.gz
This will list all the .gz files on your directory with
information about permission.
UNIX lets you combine virtually any commands with the pipe symbol
|.
The output of the first command is used as input for the next command.
Combining grep and wc will give you the
number of lines having a particular pattern:
grep session1_file1.txt | wc -l
(you can also count by using the -c command in grep, but here we are illustrating how to combine commands).
This may take a while…patience is a virtue :)
Sorting columns of a tabular file can be useful for digesting large data outputs. Also very useful for sorting lists of p-values, fsts etc
This will sort a file based on whatever is in the first column.
sort session1_file1.txt > names_sorted.txt
Check the names_sorted.txt file
What happened?
If you want to sort a file based on a different column, use the -k option
sort -k 2 session1_file1.txt > names_sorted_2.txt
Why is the output empty?
sort options
-r reverse sort
tar vs gzip: tar assembles files together, gzip compresses them.
The tar command is used to create .tar.gz or .tgz archive files, also called “tarballs.”
This command has a large number of options, but you just need to remember a few letters to quickly create archives with tar.
The tar command can extract the resulting archives, too.
tar -czvf name-of-archive.tar.gz /path/to/directory-or-file
-c: Create an archive.
-z: Compress the archive with gzip.
-v: Display progress in the terminal while creating the archive, also known as “verbose” mode.
-f: Allows you to specify the filename of the archive.
tar -czvf test.tar.gz *.fas
Once you have an archive, you can also extract it with the
tar command.
The following command will extract the contents of
archive.tar.gz to the current directory.
tar -xzvf archive.tar.gz
The -x switch replaces the -c switch. This specifies you want to extract an archive instead of create one.
How to create a basic shell script?
A shell script is a text file with the following format:
All bash scripts start with the line #!/bin/bash and
then either further comment lines that start with # or
straight into commands.
You can save this as a file with a .sh in
nano or touch
You will have to use chmod 777 to make it executable.
Then…
Try this simple shell script
First type nano then click enter:
Then type, #!/bin/bash
type the command echo Hello World!
CLick ctrl + x, then
´yto save and name the filehello.shthenEnter`
Now, change the permission of the file by typing:
chmod 777 hello.sh
Usually the file turns ´green´ indicating that it is now an executable file
Now run the shell script that you just made
sh hello.sh
## Hello World!
Did you get the same results?
You can try creating longer scripts that do more complex tasks or
loop through processes. Try this basic for
loop.
Looping is very important especially if you have so much files and you would like to automate. Doing loops in linux is much simpler than what you think
Try this!
What happened? What did it do?
This time lets practice on a real `.fastq´file
Lets extract all contigs that has the sequence
AAAGGGCCCTTT in the big_data.fastq file
for line in big_data.fastq
do
grep -B 1 -A 2 "AAAGGGCCCTTT" big_data.fastq
done > extracted_contigs.fastq.gz
then count the number of extracted contigs
grep -c "^@" extracted_contigs.fastq.gz
This may take a while!
Example Output:
file1.txt: 150 words
file2.txt: 98 words
file3.txt: 200 words
Hints:
Use a for loop to iterate over .txt files.
Use wc -w to count words.
Use echo to print the results.
Answer?
#!/bin/bash
for file in *.txt; do
word_count=$(wc -w < "$file")
echo "$file: $word_count words"
done
Who got this answer? or a different answer?
Now I want you to solve this probset using any AI tools
.fastq.gz files and to
processed_$file ($name of the file.fastq.gz)Answer?
#!/bin/bash
for file in *.fastq.gz; do
mv "$file" "processed_$file"
done
Interpreted language – quick to program
Easy to learn compared to most languages
Designed for working with text files
Free for all operating systems
Most popular language in bioinformatics – many scripts available you can “borrow”, also ready made modules.
Many users now using Python which is more powerful than Perl.
Better syntax
Used beyond bioinformatics
Fewer ready made scripts available online.
Many libraries available for bioinformatics e.g Biopython.